Performance enhancement of high order Hahn polynomials using multithreading

Orthogonal polynomials and their moments have significant role in image processing and computer vision field. One of the polynomials is discrete Hahn polynomials (DHaPs), which are used for compression, and feature extraction. However, when the moment order becomes high, they suffer from numerical instability. This paper proposes a fast approach for computing the high orders DHaPs. This work takes advantage of the multithread for the calculation of Hahn polynomials coefficients. To take advantage of the available processing capabilities, independent calculations are divided among threads. The research provides a distribution method to achieve a more balanced processing burden among the threads. The proposed methods are tested for various values of DHaPs parameters, sizes, and different values of threads. In comparison to the unthreaded situation, the results demonstrate an improvement in the processing time which increases as the polynomial size increases, reaching its maximum of 5.8 in the case of polynomial size and order of 8000 × 8000 (matrix size). Furthermore, the trend of continuously raising the number of threads to enhance performance is inconsistent and becomes invalid at some point when the performance improvement falls below the maximum. The number of threads that achieve the highest improvement differs according to the size, being in the range of 8 to 16 threads in 1000 × 1000 matrix size, whereas at 8000 × 8000 case it ranges from 32 to 160 threads.


Introduction
Moment theory is a powerful tool in the areas of image processing, pattern recognition, and computer vision applications [1].Signals are described using scalar values called moments (one, two, or more dimensions).A set of polynomial basis functions is utilized to compute moments.These basis functions are used to convert signals, voice or images, to the transform domain [2,3].To deal with the problem of pattern identification, Hu [4] introduced geometric moments and moments invariants.The proposed momemnts are not orthogonal, which result in numerical difficulties [5].
Continuous moments could be determined using continuous orthogonal polynomials like Zernike [6] or by using Tchebichef with altered radius [7].Continuous moment functions can be incorrect due to two common types of errors: image coordinate transformation and continuous integral approximation.[8].Due to the utilization of discretization and approximation throughout the process of image reconstruction, the obtained image will be imperfect [9].
In order to avoid the aforementioned constraints, Discrete Orthogonal Polynomials (DOPs) have been concentrated on by the researchers.This is due to their remarkable image reconstruction features [8][9][10].In addition, 1D and 2D signals can be represented by discrete orthogonal moments (DOMs) without redundancy.Also, they have great energy compaction, and spectrum resolution characteristics [11][12][13][14].Signal representation and feature extraction have recently been used to discrete Tchebichef polynomials [15,16], discrete Hahn moments [17], and discrete Krawtchouk moments [18,19].It is note worthy that the DOPs are utilized for solving linear functional differential equations [20].
DOPs' are solid owing to their significant features, which involve localization, energy compression, watermarking, signal extraction features, numerical stability, efficient data processing, and resilient data analysis [3,[21][22][23][24][25].At the same time, the vital characteristics of majority of DOMs are not applied to large-sized images which is due to limitation in the computation of polynomials [26].
The DOPs limitations such as overflow, the instability of the polynomial values, and the high computational complexity have resulted in this constraints.Therefore, an improved recurrence technique for generating higher orders are being improved, for example Tchebichef [16] and Krawtchouk [18] polynomials.Recently, Researchers have considered other DOPS, for example Charlier polynomials [27] and Hahn polynomials [26].
The computation of DOP coefficients and the propagation of errors have been simplified by using the recursive algorithms [28,29].Regarding degree n either a single or double recursive formula can be employed.It also considers the time or spatial coordinate.To overcome the instability in the numerical values, the DOP coefficients should be calculated in the direction of the variable n.However, when the size of one or two dimensional signals turns large these calculations become inefficient.Since small values are assumed for the squared norm of the scaled Tchebichef polynomials, the coefficients of Tchebichef polynomials, for example, suffer from instabilities in the numerical values.To solve the aforementioned problem, the recurrence method in the x-direction was introduced by Mukundan [9].After this study, this issue has received significant attention in many studies, for example, [28].
Generally, there has been a lot of focus on computation cost [30,31].It is considered as a key element which assist in ill-conditioning.For this reason, large number have considered it [32,33].This drawback is addressed in [34] by using a rapid and efficient computation method for Meixner moment coefficients.Another research introduced a fast and stable approach of Tchebichef moments for higher polynomial order, this is performed by combining the recurrence algorithms in the n and x directions [16].Daoui et al. [26] utilized Gram-Schmidt orthogonalization procedure to reduce numerical error propagation.However, this technique is relatively slow.
This paper proposes a novel technique inspired by discrete orthogonal Hahn moments.Based on the literature, the three-term recurrence algorithms have been utilized in several existing works to tackle the problem of computational cost and propagation error due to gamma and binomial functions [35].In [28], the n-direction recurrence algorithm was employed with an initial value starting at n, x = 0.The drawback of the algorithm presented in [28] (the recurrence algorithm in the n-direction) comes from starting the sets of initial values which are based on the initial value at n, x = 0.The initial sets are computable for a very restricted DHaP polynomials size and parameters.This results in a maximum computable polynomial size of 135.In other words, the limitation is due to the employed formula.In addition to the issue of high computational cost, algorithm used has numerical instability.To resolve the issue of the recurrence algorithm in the n-direction, the x-direction recurrence relation is adopted with a symmetry relation for equal values of the polynomials parameters [28].Using the symmetry relation allows a reduction of 50% in the computed coefficients, which reduces the computation cost.Yet the recurrence relation in the x-directions has two limitations.The first limitation is that, according to the nature of the formula being utilized, the initial set becomes 0 when samples size or parameter values tends to be large.The second limitation is that when the degree of the polynomial increases, the coefficient values underflow because the initial value are less than 10 −324 , which equals zero for various values of the polynomials' parameters.The highest possible order that can be calculated occurs at n = 1423.To overcome the limitation of previous recurrence algorithms, Daoui et al. [26] presented a technique based on the n-direction recurrence relation and the Gram-Schmidt orthonormalization process (GSOP).The utilization of the GSOP minimizes the numerical errors due to the use of the n-direction recurrence algorithm.However, the GSOP-based recurrence resolves the orthogonality of the DHaPs, but it has several limitations.First, the algorithm is unable to accurately calculate the coefficients of the DHaP when the parameters are not equal.Second, due to the technique used to calculate the initial values, the algorithm is still unable to generate DHaP for a wide range of DHaP parameters.Third, the nested loops of the GSOP algorithm result in a high computational cost, which in turn raises the number of processes required to compute the coefficients of the DHaP.Recently, a new mathematical model has been presented by [36], which can compute the DHaP's initial value for a wide range of DHaP parameters values.In order to stabilize the computation of the DHaP coefficients, the algorithm also consists of two recurrence algorithms with adaptive thresholds.Although the algorithm in [36] can compute the coefficients of the polynomials more accurately than other algorithms, it still suffers from computation overhead.
In this paper, a fast approach for computing the DHaPs is proposed and applied to high orders.This work takes advantage of the multithread for the computation of Hahn polynomials coefficients.To take advantage of the available processing capabilities, independent calculations are divided among threads.The research provides a distribution method to achieve a more balanced processing burden among the threads.
This paper is organised as follows: in Section "Mathematical definition of DHaP and its moments" Preliminaries and current three-term recurrence algorithms are discussed.The proposed recurrence algorithm is presented in Section "Proposed Recurrence Algorithm".In Section "Experimental Results", the proposed recurrence method is evaluated by an experimental investigation.Finally, the paper is concluded in Section "Conclusion".

Mathematical definition of DHaP and its moments
This section presents the mathematical principles of the DHaP and their moments.

The definition of DHaPs
The nth order of the DHaP is defined as [28]: where 3 F 2 (�) is the generalised hypergeometric series denoted by: and (�) k is the Pochhammer symbol [37].
The orthogonality of the DHaPs is satisfied as follows: where ρ demotes the norm function of DHaP, ω represents the weight function of the DHaP, and δ nm is the Kronecker delta.The norm and weight functions of the DHaP are defined as follows: The nth degree of the weighted and normalized DHaP is given by The definition of DHaM DHaMs are the projection of the signals on the basis of the DHaP.Suppose a 2D signal f(x, y) of size N 1 × N 2 .Then, the DHaMs, C nm , can be calculated: To reconstruct back the 2D signal, image, to the spatial domain, the reconstructed signal f ðx; yÞ can be computed as follows:

Related work
It is well known that hypergeometric series defined in Eq (1) is computationally cost and shows imprecise precision of the polynomials coefficients; thus, the three term recurrence relations are used.The available recurrence relations with their analysis are discussed in this section.

The recurrence relation in the n-direction (RRnd).
The nth degree of the DHaP at the xth index is defined as follows [28] Ĥa;b n ¼ 2; 3; . . .; N À 1; and the recurrence relation parameters are defined by: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi oðxÞ rð1Þ The problem of the recurrence relation in the n-direction recurrence algorithm is due to the utilized initial values Ĥa;b 0 ðx; NÞ and Ĥa;b 1 ðx; NÞ.These initial values bound the polynomial to low values of polynomial size N, where the largest size can be obtained is 135 samples which occurs at limited range of DHaP parameters, α = 20 and β = 20.This limitation occurs because of the utilized formulas.To resolve this problem, the complexity of the utilized formulas can be reduced; however, the recurrence relation in the n-direction still shows numerical propagation error [36].

The recurrence relation in the x-direction (RRxd).
To compute the DHaPs at the xth index with nth degree, the following recurrence is used [28]: x ¼ 2; 3; . . .; N À 1; and The recurrence relation coefficients η 1 , η 2 , and η 3 are computed as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi oðxÞ À lðnÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi oðx À 1Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi oðx À 2Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi oð1Þ oð0Þ It is noteworthy that for the recurrence relation in the x-direction, symmetry relation [26] is employed to recude the computation cost: The utilization of the symmetry relation Eq (20) will reduce the computed coefficients to 50%.However, the recurrence relation in the x-direction still has two limitations.These limitations are: 1.The values of the Ĥa;b n ð0; NÞ tend to zero as the number of samples (N) increases and the values of the DHaP parameters becomes big.This is due to the formula used in Eq (18), and 2. The coefficient values of the DHaPs become underflowed as the polynomial degree (n) becomes large.This is due to the values of the initial becomes less than 10 −324 , which in turn goes to zero in different environments [36].
It is noteworthy that the DHaPCs become zero as the polynomial degree increases.For instance, the maximum non-zero coefficients occurred at n = 1423 when N = 1600 and α = β = 10 [36].
Recurrence relation based on Gram-Schmidt orthonormalization process (RRGSOP).Daoui et al. [26] introduced their algorithm which is based on Gram-Schmidt orthonormalization process (GSOP), as well as n-direction recurrence relation to determine DHaP.In this case, GSOP is introduced to solve the issue instability in the DHaPCs.The proposed computes the initial sets Ĥa;b 0 ðx; NÞ and Ĥa;b 1 ðx; NÞ.Recurrence relation in the n-direction is utilised to find the coefficients for n > 1 is then applied in order to reduce computational errors produced through the n-direction recurrence algorithm.Despite the fact that GSOP-based recurrence algorithm fulfil orthogonality condition, it has the following problems: 1.When α 6 ¼ β, the algorithm fails to correctly identify the DHaP coefficients [26].
2. This algorithm is incapable to provide DHaP for a large range of values for α and β.
3. GSOP-based recurrence algorithm suffers from computational complexity as a result of the nested loops utilized to reduce the error for each polynomial degree.This can increase the number of operations required to calculate DHaP coefficients.
Hybrid recurrence algorithm.The authors in [36] provided a new algorithm for the purpose of solving the issues and limitations from previous algorithm.In this case the authors looked at two recurrence algorithms (which are the n-and x-recurrence relations) as well as adaptive threshold in order to be able to stabilize the generation of the DHaP coefficients.It should be noted that computational cost is considered important factor which is increased in this algorithm.

Proposed recurrence algorithm
In this section, the proposed multi-thread recurrence algorithm for DHaPs is presented in details.
The matrix of the DHaPs is divided into four parts (Part H1, Part H2, Part H3, and Part H4) similar to the partitions made in [36] (see Fig 1).First, computation of the first two columns at x = 0 and 1 are performed before the computation of the coefficients at part H1 (see Fig 2).Then, the proposed algorithm distributes the rows among a set of threads rather than the sequential processing of the H1 rows, which permits parallel processing of the coefficients.As every row is computed independently of the other rows.Part H2 is similarly divided among the same number of threads as part H1.Following the calculation of the last two columns, the threads start calculating the coefficient.
Furthermore, parts 3 and 4 (H3 and H4), begin respectively after parts H1 and H2 have been completed.The coefficients of the last two rows in part H1 are required to calculate part H3, however each column is independent of the others.Consequently, to perform parallel computation of the coefficients, the columns in part H3 are distributed among a set of threads.In the same way, after calculating part H2 coefficient, part H4 columns are processed using a set of threads.
For the case where α = β, there is a mirroring property that can be used to reduce the processing load.In the proposed algorithm, parts H2 and H4 coefficients are calculated inline with the part H1 and H3 calculations respectively.Using the available parallel resources, which are represented by multicores.The number of threads assigned to each scenario guarantees that parts H1 and H2 have the same processing burden.In parts H1 and H2 for 2N/5 rows, the number of rows for each thread will be bunch1 = 2N/5/Th; Th is the number of threads.The number of columns assigned for each thread in parts H3 and H4 will be bunch2 = (N/2)/Th.

Experimental results
In order to evaluate the proposed threaded algorithm different matrix sizes are considered, namely 1000 × 1000, 2000 × 2000, 4000 × 4000, and 8000 × 8000.The unthreaded case in [36] is compared to some works in literature TRx [28], TRn [28], and TRGSOP [26] for the cases where α = β and a 6 ¼ b to evaluate their performance.Tables 1 and 2 show the normalized performance of the different algorithms as compared to the unthreaded case [36].For the case of α = β, the unthreaded case outperforms the other algorithms in all cases, except the TRn at 1000 matrix size.On the other hand, TRx outperforms the unthreaded case when α 6 ¼ β for all matrix sizes, whereas TRn falls behind the unthreaded case at Large matrix sizes (4000 and 8000).TRGSOP show the worst performance at both α = β and α 6 ¼ β in all matrix sizes.It should be noted that TRn, TRx, and TRGSOP are unstable in all the tested cases while the unthreaded algorithm is stable.The proposed threaded algorithm is evaluated and compared with the unthreaded case at the same cases of matrix sizes considered previously with different combinations of parameters α and β.In addition, different combinations of parameters α and β are considered as well.The proposed algorithm is based on the distribution of the computation load among independent threads to provide parallelism, thus enhancing performance.The effect of number of threads on the performance is evaluated by considering 2 threads up to 250 threads.
Starting with the cases where α = β = 10, 50, 100, 150, 200, and 250, Fig 3 shows the performance improvement achieved by the proposed algorithm with respect to the unthreaded case.It can be noticed that the improvement has been slightly affected by the changes in the parameter values α = β, this is applicable at different number of threads and different matrix sizes.The highest difference can be observed at size = 1000 for α = β = 10 where it achieves up to 23% lower improvement as compared to the other parameter cases.
As can be noticed in Fig 3, all cases achieve about 100% improvement when the process is divided among two threads.This improvement linearly increases with the increase in the number of threads up to 8 threads.The threading technique has higher improvement as compared to the unthreaded case at higher matrix size, as it achieves up to 4.7, 5.5, 5.6, and 5.8 improvement at 1000, 2000, 4000, and 8000 matrix sizes, respectively.At the size of 1000, the improvement reaches its maximum value at number of threads ranging from 8 to 16.As the number of threads increases over the aforementioned range the improvement gradually declines and the performance worsens as compared to the unthreaded case when the number of threads exceeds 128.This can be linked to the exceeding overhead imposed by the high number of threads as compared to the useful processing carried out by each thread.This drop in improvement can also be noticed in the case of size = 2000, but at a slower slope and at worst case the improvement does not fall below 1.4.For the size of 4000 and 8000, as the number of threads is increased above 8 the improvement is maintained above 4.5 and 5 respectively.
Higher CPU utilization may imply an improved performance as more processing power is allocated to the algorithm which leads to reduced execution time.To some extent this relation is valid as can be seen in Fig 4 .In the unthreaded case, the CPU utilization is limited to 13% which results in high delay.This CPU utilization increases as the number of threads increases and results in lower delays.This relation does not persist for matrix sizes 1000 and 2000, where the delay increases again when the number of threads exceed 16 and 32, respectively.This results in increased processing power with no performance gain which should be avoided.
The threading technique achieves improvement with respect to the unthreaded case, and this improvement increases as the matrix size increases.It achieves up to 3.1 times improvement as compared to the unthreaded case at the size of 8000.At the size of 1000, the improvement reaches its maximum value at number of threads ranging from 8 to 16 which is similar to the cases where α = β.As the number of threads increases over the aforementioned range the  improvement gradually declines and the performance worsens as compared to the unthreaded case when the number of threads exceeds 96.It reaches up to 5 times higher delay as compared to the unthreaded case, while the CPU usage is higher as shown in Fig 6 .Any processing overhead with no performance gain should be avoided as it results in wasted power consumption.It can be noticed that threading has higher improvement for the cases where α = β as compared to those cases where α 6 ¼ β.In the former case, the coefficients in P2 and P4 are a mirror of those in P1 and P3 respectively.Thus, as soon as a coefficient in P1 and P3 is calculated, its mirror is immediately written and to avoid accessing the data again, mitigating memory reads penalties.The number of cores affects the performance improvement of the threading technique.To evaluate this, the instead of enabling all of the eight logical cores as in the previous tests, only four logical cores were enabled.The tests were repeated with the new configuration and Fig 7 shows the effect of number of cores on the maximum improvement.The improvement in the four cores case is lower than that of the eight cores case by up to 48% when α 6 ¼ β (α = 100 and β = 50).For the case where α = β = 10, the four cores show up to 40% lower performance improvement as compared to the eight cores.It should be noted that at the unthreaded case both, the four cores and eight cores, perform similarly.The unthreaded case uses one core only which represents 25% and 12.5% of CPU usage in the 4 cores and 8 cores, respectively.

Conclusion
The paper proposed to exploit the parallelism in coefficient calculations allowing independent threads to process coefficients in parallel to enhance performance.The considered algorithm was analyzed at different matrix sizes, parameters' values, and number of threads.The maximum performance improvement as compared to the unthreaded case was found at the case of α = β ranging from 4.7 at 1000 × 1000 matrix size to 5.8 at 8000 × 8000 matrix size.On the other hand, when α 6 ¼ β, the maximum improvement was at the matrix size of 8000, achieving a maximum improvement of 3.1.The results show that the number of threads should be carefully selected to achieve the optimal improvement, since increasing this number over certain limits may lead to performance degradation due to the threading overheads.Small matrix size (1000 × 1000) reaches its maximum improvement when the number of threads ranges from 8 to 16 whereas the optimum range was from 32 to 160 at large matrix size (8000 × 8000).Reducing the number of cores from 8 to 4 reduces the performance by 48% and 40% for α = β and α 6 ¼ β, respectively Other candidate algorithms exist that it is encouraged to be analyzed when the threading technique is introduced into them which requires identifying the independence within each algorithm in order to be parallelized.

Fig 3 .
Fig 3.The improvement achieved by the proposed algorithm with respect to the unthreaded case for different size values with α = β.https://doi.org/10.1371/journal.pone.0286878.g003